Do: almost finished. Just make a couple of these. See how to hide code chunks for big ones of groups at end. See if there is a simple markdown solution to seeing dfsummary or remove the histograms.

Campus-Level Data for 2019:

This example highlights my data wrangling and visualization skills in an exploratory analysis of racial/ethnic inequity in public schools in Texas. This was a pet project that I created to satisfy my own curiosity on the topic and to get to know the publicly available data related to education in Texas. I clean, validate, and join several data sets with the end goal of exploring the relationship between race/ethnicity, economic disparity, and educational outcomes in Texas public schools. This page provides the steps required to go from several raw public data sets to produce the following interactive visualization, as well as several others, using the R packages listed below:

final example here using full code in echo = F chunk.

Load Packages

library(tidyverse)
library(readxl)
library(plotly)
library(gapminder)
library(kableExtra)
library(summarytools)

Percent of Students Meeting Grade-Level by Race/Ethnic Categories

For educational outcomes, I use the publicly available Texas Academic Performance Report (TAPR) data for the year 2019. This is a very large data set that that I have subsetted to focus on the percentage of students in three racial/ethnic categories (variable: Student Group) who meet the expectations of their respective grade levels (variable: meets_level_pct). I also include the numeric campus identifier cid and the respective denominators for each percentile value (variable: meets_denom), which informs the absolute number of students from each category at the school. I use this value to weight the observations in the Loess smoothing function and to inform the size parameter in the plots. For the original variable names, see the TAPR Codebook.

As described in the codebook for the data set, student groups’ percentile values with very small denominators are masked (coded as negative placeholder values of “-1”) in order to preclude the possibility of identification of individuals in the data. I exclude these masked values from the data set, as well as any school that reports missing data for all three student groups. Denominators of the second smallest student group for each school are also masked with a unique placeholder (“-3”) when the smallest group requires masking. I replace this set of masked values with a reasonable proxy of half the size of the next largest group in the data.

This code does the following:

  1. Read in subsetted TAPR Data
  2. Filter schools with all missing values
  3. Pivot the data to include (up to) three observations per school, corresponding to the three percentile scores meets_level_pct of students in each racial/ethnic category that meet the grade-level expectations.
  4. Create category names for student groups: Student Group and group_denom
  5. Remove masked data for percentiles and recode the masked denominator with the proxy.
  6. Create an additional variable ln_meets_denom of logged group size for use as weight and size in the plots.

The result is

campus_2019 <- read.csv("docs/TAPR_2019_subset.csv") %>%
  transmute(
    cid = CAMPUS,
    `African American` = CDB00A001219R,
    `Hispanic` = CDH00A001219R,
    `White` = CDW00A001219R,
    aa_total = CDB00A001019D,
    h_total = CDH00A001019D,
    w_total = CDW00A001019D) %>%
  # Filter out schools with no data (~ 10% of campuses)
  filter(!is.na(`African American`) | !is.na(`Hispanic`) | !is.na(`White`))

data <- cbind(
  pivot_longer(campus_2019 %>% select(1:4), cols = 2:4, names_to = "Student Group", values_to = "Meets Grade Level (%)"),
  pivot_longer(campus_2019 %>% select(5:7), cols = 1:3, names_to = "group_denom", values_to = "Size of Student Group at School")) %>%
  group_by(cid) %>%
  mutate(
    `Meets Grade Level (%)` = ifelse(`Meets Grade Level (%)` == -1, NA, `Meets Grade Level (%)`), # rate = -1 is masked data and unusable
    `Size of Student Group at School` = ifelse(`Size of Student Group at School` == -1, NA, # denominator = -1 is masked data and unusable
                     `Size of Student Group at School`), 
    `Size of Student Group at School` = ifelse(`Size of Student Group at School` == -3, # denominator = -3 is second smallest group
                     nth(`Size of Student Group at School`, 2, order_by = `Size of Student Group at School`) / 2, 
                     `Size of Student Group at School`), # recover with reasonable proxy of half largest group
    ln_meets_denom = log(`Size of Student Group at School`)) %>% # log will be better for size parameter
  ungroup()


data[1:9,] %>%
  kbl(caption = "Structure of the Data Set: Showing the First Three Schools") %>%
  kable_paper("hover", full_width = F)
Structure of the Data Set: Showing the First Three Schools
cid Student Group Meets Grade Level (%) group_denom Size of Student Group at School ln_meets_denom
1902001 African American 71 aa_total 7 1.945910
1902001 Hispanic 58 h_total 19 2.944439
1902001 White 67 w_total 180 5.192957
1902041 African American 0 aa_total 5 1.609438
1902041 Hispanic 48 h_total 31 3.433987
1902041 White 60 w_total 282 5.641907
1902103 African American 50 aa_total 6 1.791759
1902103 Hispanic 75 h_total 12 2.484907
1902103 White 62 w_total 330 5.799093

Campus-Level Poverty Data

For campus-level poverty statistics, I use the public data available at the Texas Education Agency’s (TEA) school data visualizer, which includes an extensive set of campus-level attributes. I am primarily interested in the percentage of students from economically disadvantaged backgrounds, measured as the percentage of students “…Students eligible for free or reduced-price lunch or other public assistance as reported on the PEIMS October snapshot” (see definitions). Unfortunately, the source does not reveal precisely which year this data comes from! School-level poverty is a relatively stable attribute, though, so this is not a huge concern.

The challenge is that this data does not include the campus ID as a stand-alone variable. I must first extract it from a longer string to create a comparable cid variable that can be used to join this data set to the first one. Because there is no universal pattern to parse this string, I have to split on three different patterns and extract the rightmost element in order to recover all the campus identifiers. This is because some schools have more than one set of the ” (” pattern on the right-hand side of the “||” pattern. Each

school <- read_excel("docs/school.xlsx")
school <- school %>%
  transmute(
    Campus = `Campus or District`, 
    `Eco. Disadvantaged (%)` = `% Eco Disadvantaged`,
    `Campus Type` = factor(`Entity Type`),
    enrollment_public = Enrollment) %>%
  mutate(
    n = str_split(Campus, " \\|\\|", simplify = TRUE)[,2], # First split on "||"
    nn = str_split(n, " \\(", simplify = TRUE)[,2], # Next splot on " ("
    nnn = str_split(n, " \\(", simplify = TRUE)[,3], # A few have an additional " ("
    cid = as.numeric(gsub("\\D+", "", nn)),
    cid = ifelse(is.na(cid), as.numeric(gsub("\\D+", "", nnn)), cid)) %>%
  select(-starts_with("n"))

school[1:5,] %>%
  kbl(caption = "Structure of the Poverty Data after Recovering the Campus Identifier") %>%
  kable_paper("hover", full_width = F)
Structure of the Poverty Data after Recovering the Campus Identifier
Campus Eco. Disadvantaged (%) Campus Type enrollment_public cid
CAYUGA H S || CAYUGA ISD (001902001) 34.9 High School 166 1902001
CAYUGA MIDDLE || CAYUGA ISD (001902041) 40.6 Middle & Jr. High School 133 1902041
CAYUGA EL || CAYUGA ISD (001902103) 38.1 Elementary School 236 1902103
ELKHART H S || ELKHART ISD (001903001) 47.4 High School 344 1903001
ELKHART MIDDLE || ELKHART ISD (001903041) 45.1 Middle & Jr. High School 268 1903041

Codes: School, District, Region, County

This data set from TEA includes the codes that can be used for geographic identification in the plots.

codes <- read_csv("docs/school and district.csv")
codes <- codes %>%
  transmute(
    cid = as.numeric(gsub("[^[:alnum:] ]", "", `School Number`)),
    district = as.numeric(gsub("[^[:alnum:] ]", "", `District Number`)),
    county = as.numeric(gsub("[^[:alnum:] ]", "", `County Number`)),
    region = as.numeric(gsub("[^[:alnum:] ]", "", `ESC Region Served`)))

Join Data and See Descriptive Statistics

Taking the TAPR 2019 data as the base set, I match approximately 98.3% of campuses with the TEA school-level data by their campus id (cid). For parsimony, I exclude those %1.7% campuses that do not exist in both data sets. We can see that the masking and missing data results in approximately 10% of missing data from the TAPR 2019 data; however, due to the logic of how the masking is coded, those masked observations are likely to be smaller, relatively more homogeneous schools, otherwise, they would not have very small absolute numbers of students from any of the three racial/ethnic backgrounds.

#campus <- left_join(data, school, by = "cid") %>% left_join(codes, by = "cid")
campus <- left_join(school, codes, by = "cid") %>% right_join(data, by = "cid") %>%
  filter(!is.na(Campus))
dfSummary(campus,
          plain.ascii  = FALSE, 
          style        = "grid", 
          graph.magnif = 0.75, 
          valid.col    = FALSE,
          tmp.img.dir  = "/tmp",
          silent = TRUE)

Data Frame Summary

Dimensions: 23658 x 13
Duplicates: 0

No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 Campus
[character]
1. 3D ACADEMY || DONNA ISD (
2. A & M CONS H S || COLLEGE
3. A & M CONSOLIDATED MIDDLE
4. A B DUNCAN COLLEGIATE EL
5. A C BLUNT MIDDLE || ARANS
6. A C JONES H S || BEEVILLE
7. A C JONES HEALTH PROFESSI
8. A E BUTLER INT || QUINLAN
9. A G ELDER EL || JOSHUA IS
10. A LEAL JR MIDDLE || HARLA
[ 7876 others ]
3 ( 0.0%)
3 ( 0.0%)
3 ( 0.0%)
3 ( 0.0%)
3 ( 0.0%)
3 ( 0.0%)
3 ( 0.0%)
3 ( 0.0%)
3 ( 0.0%)
3 ( 0.0%)
23628 (99.9%)
0
(0.0%)
2 Eco. Disadvantaged (%)
[numeric]
Mean (sd) : 63.5 (26.1)
min < med < max:
0 < 67.5 < 100
IQR (CV) : 42.1 (0.4)
982 distinct values 0
(0.0%)
3 Campus Type
[factor]
1. Elementary School
2. High School
3. K-12 Schools
4. Middle & Jr. High School
12963 (54.8%)
4539 (19.2%)
1284 ( 5.4%)
4872 (20.6%)
0
(0.0%)
4 enrollment_public
[numeric]
Mean (sd) : 646.5 (538.7)
min < med < max:
1 < 526.5 < 5328
IQR (CV) : 395 (0.8)
1728 distinct values 0
(0.0%)
5 cid
[numeric]
Mean (sd) : 115822704 (72639619)
min < med < max:
1902001 < 101912278 < 254902101
IQR (CV) : 120998818 (0.6)
7886 distinct values 0
(0.0%)
6 district
[numeric]
Mean (sd) : 115857 (72676.9)
min < med < max:
1902 < 101912 < 254902
IQR (CV) : 120999 (0.6)
1193 distinct values 144
(0.6%)
7 county
[numeric]
Mean (sd) : 115 (72.7)
min < med < max:
1 < 101 < 254
IQR (CV) : 121 (0.6)
253 distinct values 144
(0.6%)
8 region
[numeric]
Mean (sd) : 9.7 (5.7)
min < med < max:
1 < 10 < 20
IQR (CV) : 9 (0.6)
20 distinct values 144
(0.6%)
9 Student Group
[character]
1. African American
2. Hispanic
3. White
7886 (33.3%)
7886 (33.3%)
7886 (33.3%)
0
(0.0%)
10 Meets Grade Level (%)
[integer]
Mean (sd) : 46.2 (18.7)
min < med < max:
0 < 45 < 100
IQR (CV) : 25 (0.4)
100 distinct values 2392
(10.1%)
11 group_denom
[character]
1. aa_total
2. h_total
3. w_total
7886 (33.3%)
7886 (33.3%)
7886 (33.3%)
0
(0.0%)
12 Size of Student Group at School
[numeric]
Mean (sd) : 376.6 (513.2)
min < med < max:
2.5 < 195 < 5395
IQR (CV) : 411 (1.4)
2329 distinct values 2539
(10.7%)
13 ln_meets_denom
[numeric]
Mean (sd) : 5.1 (1.4)
min < med < max:
0.9 < 5.3 < 8.6
IQR (CV) : 2 (0.3)
2329 distinct values 2539
(10.7%)

Exploring the Relationship between Race/Ethnicity, Poverty, and Education Outcomes

Here I write a function to build plotly interactive visualizations (hover over the image to get point-level data). School-level data is plotted on the x-axis based on poverty levels Eco. Disadvanted (%). School-level data disaggregated by race/ethnicity on the y-axis to show each racial/ethnic groups’s academic performance as Meets Grade Level (%). I write a custom function that takes the data and two filtering parameters to slice the full data set: campus %>% filter('Campus Type' == {{type}} & region == {{r}}). I pass this function to lapply() to generate any or all combinations of region and Campus Type (up to )

r <- 1:20

plots <- function(data, r, type) {

t <-  ggplot(data = campus %>% filter(`Campus Type` == {{type}} & region == {{r}}), 
         aes(label = `Campus`, label2 = `Size of Student Group at School`, x = `Eco. Disadvantaged (%)`, y = `Meets Grade Level (%)`)) +
  geom_line(alpha = .3, aes(group = cid)) +
  geom_point(alpha = .4, aes(size = `Size of Student Group at School`, color = `Student Group`)) +
  geom_smooth(size = 2, method = loess, color = "black", se = FALSE, aes(group = `Student Group`)) +
  geom_smooth(size = 1.8, method = loess, se = FALSE, aes(color = `Student Group`)) +
  guides(size = FALSE) +
  theme(legend.position = "bottom") +
  xlim(c(0,100)) + ylim(c(0,100)) + 
    ggtitle(paste("Region", {{r}}, {{type}}, "Performance by Race/Ethnicity and Poverty", sep = " "))
  
t[[r]] <- ggplotly(t, tooltip = c("label", "label2", "x", "y"))

#print(t[[r]])
#return(t[[r]])
}

# This works in R but not for markdown to html
# for (i in 1:20){
#   plots(campus, i)
# }

gg_hs <- lapply(r, plots, data = campus, type = "High School")

gg_es <- lapply(r, plots, data = campus, type = "Elementary School")
gg_es

[[1]]

[[2]]

[[3]]

[[4]]

[[5]]

[[6]]

[[7]]

[[8]]

[[9]]

[[10]]

[[11]]

[[12]]

[[13]]

[[14]]

[[15]]

[[16]]

[[17]]

[[18]]

[[19]]

[[20]]

gg_hs[[1]]
subplot(gg_hs[[10]], gg_es[[10]], gg_hs[[20]], gg_es[[20]], nrows = 2)
htmltools::tagList(list(gg_hs[[10]], gg_es[[10]], gg_hs[[20]], gg_es[[20]]))
htmltools::tagList(list(gg_hs[[1]], gg_hs[[3]], gg_es[[1]], gg_es[[3]]))

Private Schools Data 2018 - 2019

private <-  read_excel("docs/private schools 2018-2019.xlsx")
private <-private %>%
  filter(!`Grade High` %in% c("Pre-K", "Early Education") & Closed == FALSE) %>%
  transmute(
    district = as.numeric(`District Number`),
    county = as.numeric(`County Number`),
    region = as.numeric(`Region Name`),
    enrollment_private = as.numeric(Enrollment))

private_region <- private %>%
  group_by(region) %>%
  summarise(
    private_enrollment_region = sum(enrollment_private, na.rm = TRUE)) %>%
  ungroup()

public_region <- school %>%
  left_join(codes, by = "cid") %>%
  group_by(region) %>%
  summarise(
    public_enrollment_region = sum(enrollment_public, na.rm = TRUE)) %>%
  ungroup()
  
region <- left_join(private_region, public_region, by = "region") 
region <- region %>%
  mutate(
    `Total Enrollment` = private_enrollment_region + public_enrollment_region,
    `Private Students in Region (%)` = 100 * ( private_enrollment_region / `Total Enrollment`),
    Region_lab = paste("Region", region, sep = " "),
    Region = factor(region, ordered = TRUE, labels = Region_lab)) %>%
  arrange(desc(`Private Students in Region (%)`)) %>%
  mutate(
    Region_lab = paste("Region", region, sep = " "),
    `Region Ordered` = factor(region, levels = region, labels = paste("Region", region, sep = " ")))

#rm(private, private_region, public_region)

ggplotly(ggplot(region, aes(x = log(`Total Enrollment`), y = `Private Students in Region (%)`, label = region)) +
  geom_smooth(alpha = .4, method = lm) +
    geom_text(size = 5))
## `geom_smooth()` using formula 'y ~ x'
ggplotly(ggplot(region %>% filter(!region %in% c(1, 3)), aes(x = log(`Total Enrollment`), y = `Private Students in Region (%)`, label = region)) +
  geom_smooth(alpha = .4, method = lm) +
    geom_text(size = 5))
## `geom_smooth()` using formula 'y ~ x'
p1 <- ggplotly(ggplot(region, aes(x = `Private Students in Region (%)`, y = Region)) +
  geom_col())

p2 <- ggplotly(ggplot(region, aes(x = `Private Students in Region (%)`, y = `Region Ordered`)) +
  geom_col())

subplot(p1, p2, margin = 0.07)
---
title: "Ex. 1 Texas Public Schools"

---

Do: almost finished. Just make a couple of these. See how to hide code chunks for big ones of groups at end. See if there is a simple markdown solution to seeing dfsummary or remove the histograms.

# Campus-Level Data for 2019: 

This example highlights my data wrangling and visualization skills in an exploratory analysis of racial/ethnic inequity in public schools in Texas. This was a pet project that I created to satisfy my own curiosity on the topic and to get to know the publicly available data related to education in Texas. I clean, validate, and join several data sets with the end goal of exploring the relationship between race/ethnicity, economic disparity, and educational outcomes in Texas public schools. This page provides the steps required to go from several raw public data sets to produce the following interactive visualization, as well as several others, using the R packages listed below:

final example here using full code in echo = F chunk.

### Load Packages
```{r, message = FALSE, warning = FALSE}
library(tidyverse)
library(readxl)
library(plotly)
library(gapminder)
library(kableExtra)
library(summarytools)
```

### Percent of Students Meeting Grade-Level by Race/Ethnic Categories

For educational outcomes, I use the publicly available [Texas Academic Performance Report (TAPR) data](https://rptsvr1.tea.texas.gov/perfreport/tapr/2019/download/DownloadData.html) for the year 2019. This is a very large data set that that I have subsetted to focus on the percentage of students in three racial/ethnic categories (variable: `Student Group`) who meet the expectations of their respective grade levels (variable: `meets_level_pct`). I also include the numeric campus identifier `cid` and the respective denominators for each percentile value (variable: `meets_denom`), which informs the absolute number of students from each category at the school. I use this value to weight the observations in the Loess smoothing function and to inform the size parameter in the plots. For the original variable names, see the [TAPR Codebook](https://rptsvr1.tea.texas.gov/perfreport/tapr/2019/download/campstaar2b.html).  

As described in the [codebook for the data set](https://rptsvr1.tea.texas.gov/perfreport/tapr/2019/download/campstaar2b.html), student groups' percentile values with very small denominators are masked (coded as negative placeholder values of "-1") in order to preclude the possibility of identification of individuals in the data. I exclude these masked values from the data set, as well as any school that reports missing data for all three student groups. Denominators of the second smallest student group for each school are also masked with a unique placeholder ("-3") when the smallest group requires masking. I replace this set of masked values with a reasonable proxy of half the size of the next largest group in the data.  

This code does the following:  

1. Read in subsetted TAPR Data
2. Filter schools with all missing values
3. Pivot the data to include (up to) three observations per school, corresponding to the three percentile scores `meets_level_pct` of students in each racial/ethnic category that meet the grade-level expectations.
4. Create category names for student groups: `Student Group` and `group_denom`
5. Remove masked data for percentiles and recode the masked denominator with the proxy.
6. Create an additional variable `ln_meets_denom` of logged group size for use as weight and size in the plots.  

The result is 

```{r, message=FALSE, warning=FALSE}
campus_2019 <- read.csv("docs/TAPR_2019_subset.csv") %>%
  transmute(
    cid = CAMPUS,
    `African American` = CDB00A001219R,
    `Hispanic` = CDH00A001219R,
    `White` = CDW00A001219R,
    aa_total = CDB00A001019D,
    h_total = CDH00A001019D,
    w_total = CDW00A001019D) %>%
  # Filter out schools with no data (~ 10% of campuses)
  filter(!is.na(`African American`) | !is.na(`Hispanic`) | !is.na(`White`))

data <- cbind(
  pivot_longer(campus_2019 %>% select(1:4), cols = 2:4, names_to = "Student Group", values_to = "Meets Grade Level (%)"),
  pivot_longer(campus_2019 %>% select(5:7), cols = 1:3, names_to = "group_denom", values_to = "Size of Student Group at School")) %>%
  group_by(cid) %>%
  mutate(
    `Meets Grade Level (%)` = ifelse(`Meets Grade Level (%)` == -1, NA, `Meets Grade Level (%)`), # rate = -1 is masked data and unusable
    `Size of Student Group at School` = ifelse(`Size of Student Group at School` == -1, NA, # denominator = -1 is masked data and unusable
                     `Size of Student Group at School`), 
    `Size of Student Group at School` = ifelse(`Size of Student Group at School` == -3, # denominator = -3 is second smallest group
                     nth(`Size of Student Group at School`, 2, order_by = `Size of Student Group at School`) / 2, 
                     `Size of Student Group at School`), # recover with reasonable proxy of half largest group
    ln_meets_denom = log(`Size of Student Group at School`)) %>% # log will be better for size parameter
  ungroup()


data[1:9,] %>%
  kbl(caption = "Structure of the Data Set: Showing the First Three Schools") %>%
  kable_paper("hover", full_width = F)

               
```

## Campus-Level Poverty Data

For campus-level poverty statistics, I use the public data available at the Texas Education Agency's (TEA) [school data visualizer](https://rptsvr1.tea.texas.gov/perfreport/account/va/va_correlate.html), which includes an extensive set of campus-level attributes. I am primarily interested in the percentage of students from economically disadvantaged backgrounds, measured as the percentage of students "...Students eligible for free or reduced-price lunch or other public assistance as reported on the PEIMS October snapshot" ([see definitions](https://rptsvr1.tea.texas.gov/perfreport/faqsite/glossary.html)). Unfortunately, the source does not reveal precisely which year this data comes from! School-level poverty is a relatively stable attribute, though, so this is not a huge concern.  

The challenge is that this data does not include the campus ID as a stand-alone variable. I must first extract it from a longer string to create a comparable `cid` variable that can be used to join this data set to the first one. Because there is no universal pattern to parse this string, I have to split on three different patterns and extract the rightmost element in order to recover all the campus identifiers. This is because some schools have more than one set of the " (" pattern on the right-hand side of the "||" pattern. Each 


```{r}
school <- read_excel("docs/school.xlsx")
school <- school %>%
  transmute(
    Campus = `Campus or District`, 
    `Eco. Disadvantaged (%)` = `% Eco Disadvantaged`,
    `Campus Type` = factor(`Entity Type`),
    enrollment_public = Enrollment) %>%
  mutate(
    n = str_split(Campus, " \\|\\|", simplify = TRUE)[,2], # First split on "||"
    nn = str_split(n, " \\(", simplify = TRUE)[,2], # Next splot on " ("
    nnn = str_split(n, " \\(", simplify = TRUE)[,3], # A few have an additional " ("
    cid = as.numeric(gsub("\\D+", "", nn)),
    cid = ifelse(is.na(cid), as.numeric(gsub("\\D+", "", nnn)), cid)) %>%
  select(-starts_with("n"))

school[1:5,] %>%
  kbl(caption = "Structure of the Poverty Data after Recovering the Campus Identifier") %>%
  kable_paper("hover", full_width = F)

```

# Codes: School, District, Region, County

This data set from TEA includes the codes that can be used for geographic identification in the plots.

```{r, message = FALSE, warning = FALSE}
codes <- read_csv("docs/school and district.csv")
codes <- codes %>%
  transmute(
    cid = as.numeric(gsub("[^[:alnum:] ]", "", `School Number`)),
    district = as.numeric(gsub("[^[:alnum:] ]", "", `District Number`)),
    county = as.numeric(gsub("[^[:alnum:] ]", "", `County Number`)),
    region = as.numeric(gsub("[^[:alnum:] ]", "", `ESC Region Served`)))

```



## Join Data and See Descriptive Statistics

Taking the TAPR 2019 data as the base set, I match approximately 98.3% of campuses with the TEA school-level data by their campus id (`cid`). For parsimony, I exclude those %1.7% campuses that do not exist in both data sets. We can see that the masking and missing data results in approximately 10% of missing data from the TAPR 2019 data; however, due to the logic of how the masking is coded, those masked observations are likely to be smaller, relatively more homogeneous schools, otherwise, they would not have very small absolute numbers of students from any of the three racial/ethnic backgrounds.  

```{r, results = "asis"}
#campus <- left_join(data, school, by = "cid") %>% left_join(codes, by = "cid")
campus <- left_join(school, codes, by = "cid") %>% right_join(data, by = "cid") %>%
  filter(!is.na(Campus))
dfSummary(campus,
          plain.ascii  = FALSE, 
          style        = "grid", 
          graph.magnif = 0.75, 
          valid.col    = FALSE,
          tmp.img.dir  = "/tmp",
          silent = TRUE)
```


## Exploring the Relationship between Race/Ethnicity, Poverty, and Education Outcomes

Here I write a function to build `plotly` interactive visualizations (hover over the image to get point-level data). School-level data is plotted on the x-axis based on poverty levels `Eco. Disadvanted (%)`. School-level data disaggregated by race/ethnicity on the y-axis to show each racial/ethnic groups's academic performance as `Meets Grade Level (%)`. I write a custom function that takes the data and two filtering parameters to slice the full data set: `campus %>% filter('Campus Type' == {{type}} & region == {{r}})`. I pass this function to `lapply()` to generate any or all combinations of `region` and `Campus Type` (up to )

```{r, results = "asis", warning=F, message=F}


r <- 1:20

plots <- function(data, r, type) {

t <-  ggplot(data = campus %>% filter(`Campus Type` == {{type}} & region == {{r}}), 
         aes(label = `Campus`, label2 = `Size of Student Group at School`, x = `Eco. Disadvantaged (%)`, y = `Meets Grade Level (%)`)) +
  geom_line(alpha = .3, aes(group = cid)) +
  geom_point(alpha = .4, aes(size = `Size of Student Group at School`, color = `Student Group`)) +
  geom_smooth(size = 2, method = loess, color = "black", se = FALSE, aes(group = `Student Group`)) +
  geom_smooth(size = 1.8, method = loess, se = FALSE, aes(color = `Student Group`)) +
  guides(size = FALSE) +
  theme(legend.position = "bottom") +
  xlim(c(0,100)) + ylim(c(0,100)) + 
    ggtitle(paste("Region", {{r}}, {{type}}, "Performance by Race/Ethnicity and Poverty", sep = " "))
  
t[[r]] <- ggplotly(t, tooltip = c("label", "label2", "x", "y"))

#print(t[[r]])
#return(t[[r]])
}

# This works in R but not for markdown to html
# for (i in 1:20){
#   plots(campus, i)
# }

gg_hs <- lapply(r, plots, data = campus, type = "High School")

gg_es <- lapply(r, plots, data = campus, type = "Elementary School")



```


```{r, results = "asis", warning=F, message=F}

gg_es

gg_hs[[1]]

subplot(gg_hs[[10]], gg_es[[10]], gg_hs[[20]], gg_es[[20]], nrows = 2)

htmltools::tagList(list(gg_hs[[10]], gg_es[[10]], gg_hs[[20]], gg_es[[20]]))
htmltools::tagList(list(gg_hs[[1]], gg_hs[[3]], gg_es[[1]], gg_es[[3]]))

```


# Private Schools Data 2018 - 2019

```{r}
private <-  read_excel("docs/private schools 2018-2019.xlsx")
private <-private %>%
  filter(!`Grade High` %in% c("Pre-K", "Early Education") & Closed == FALSE) %>%
  transmute(
    district = as.numeric(`District Number`),
    county = as.numeric(`County Number`),
    region = as.numeric(`Region Name`),
    enrollment_private = as.numeric(Enrollment))

private_region <- private %>%
  group_by(region) %>%
  summarise(
    private_enrollment_region = sum(enrollment_private, na.rm = TRUE)) %>%
  ungroup()

public_region <- school %>%
  left_join(codes, by = "cid") %>%
  group_by(region) %>%
  summarise(
    public_enrollment_region = sum(enrollment_public, na.rm = TRUE)) %>%
  ungroup()
  
region <- left_join(private_region, public_region, by = "region") 
region <- region %>%
  mutate(
    `Total Enrollment` = private_enrollment_region + public_enrollment_region,
    `Private Students in Region (%)` = 100 * ( private_enrollment_region / `Total Enrollment`),
    Region_lab = paste("Region", region, sep = " "),
    Region = factor(region, ordered = TRUE, labels = Region_lab)) %>%
  arrange(desc(`Private Students in Region (%)`)) %>%
  mutate(
    Region_lab = paste("Region", region, sep = " "),
    `Region Ordered` = factor(region, levels = region, labels = paste("Region", region, sep = " ")))

#rm(private, private_region, public_region)

ggplotly(ggplot(region, aes(x = log(`Total Enrollment`), y = `Private Students in Region (%)`, label = region)) +
  geom_smooth(alpha = .4, method = lm) +
    geom_text(size = 5))
ggplotly(ggplot(region %>% filter(!region %in% c(1, 3)), aes(x = log(`Total Enrollment`), y = `Private Students in Region (%)`, label = region)) +
  geom_smooth(alpha = .4, method = lm) +
    geom_text(size = 5))

p1 <- ggplotly(ggplot(region, aes(x = `Private Students in Region (%)`, y = Region)) +
  geom_col())

p2 <- ggplotly(ggplot(region, aes(x = `Private Students in Region (%)`, y = `Region Ordered`)) +
  geom_col())

subplot(p1, p2, margin = 0.07)

```















